Recitation 1


Exercise 1

  1. Students.
    The subjects are the "keys" of the table. If i'm not retarded they should always be unique.
  2. Gender, Grade, City, Faculty, Family Income.
    The variables are just the fields of the table.
  3. Variable types:
    1. Gender: categorical (Binary, with two categories M, F)
    2. Grade: quantitative discrete
    3. City: categorical (Nominal, with categories Milan, Rome, Naples, Florence)
    4. Faculty: categorical (Nominal, with categories Economics, Statistics, Political Sciences)
    5. Family Income: quantitative continuous
Brief recap on variable types

Categorical (or Qualitative) variable

Each observation belongs to a category:

  • Binary: There are only 2 categories
  • Ordinal: The categories have a hierarchy or order.
  • Nominal: No hierarchy is present.

Quantitative variable

Observations take numerical values that represent different magnitudes of the variable:

  • Discrete: The possible values come from a specific set of numbers
  • Continuous: The values are from an interval.

Exercise 2

  1. A proportion is just :
  2. Pretty simple innit:

Exercise 3

  1. The subjects here is the number of people in the family, and the frequency is how many families have this number of people in it:

Exercise 4

  1. This is getting boring:
  2. An histogram is a chart used to display numerical data.:
Histogram vs Bar chart

The two main differences between a bar chart and a histogram are:

  • The bar chart displays categorical discrete data
  • The bars of the bar chart are not adjacent to each other, there is a slight padding.
  1. The distribution is unimodal and skewed to the right.
Question

When we convert from continuous to intervals, what is the type of the new variable?

Exercise 5

  1. Each row is a data point, we treat IQ as the X variable and Salary as the Y variable, as we want to look at the correlation between Salary and IQ.

Exercise 6

  1. Mean, median and mode:
    1. Mean:
    2. Median:
      First we sort the data: 1, 3, 5, 6, 7, 8, 9, 10, 10, 14, 14, 14.
      The number of values is even, so we have 2 middle values, and .
      We take the mean of these two numbers: .
    3. Mode: .
  2. Median > Mean, therefore the distribution is slightly skewed to the left:
  3. This question is so poorly written omfg, the new values are :
    1. Mean:
    2. Median: The new middle values are and , so the new median is .
    3. Mode: .

Exercise 7

  1. Mean, and median:
    1. Mean:
    2. Median:
  2. Skewed to the right, because Median < Mean.
  3. New
  4. New
  5. New
  6. Change in mean and median:
    1. New
    2. Absolutely nothing lol. Because the order doesn't change and 400 wasn't in the middle anyways.

Recitation 2


Exercise 1

  1. The sample standard deviation is computed as follows:
  2. We are gonna do this later.
Why N - 1?

This is a non-technical explanation

Because the sample standard deviation is computed as an approximation of the real standard deviation, from a sample of the population.


This means that the data point we get are more likely to be around the mean and less likely to be on the tails of the distribution.


So the sample standard deviation always underestimates the real value.

For this reason we decrease the denominator and overshoot the number.

Exercise 3

  1. Mean and sample standard deviation:
    1. Mean:
    2. Sample standard deviation =
  2. We got to split the observation in 4 subsets of equal length:
    1. First we sort the data: 30, 35, 53, 55, 57, 57, 60, 61, 64, 71, 78, 90.
    2. Then we do the thing: [30, 35, 53], [55, 57, 57], [60, 61, 64], [71, 78, 90]
    3. Then we compute the quartiles: , ,
Quartiles recap

The Quartiles split the distribution into four parts that have the same number of observations:



You can find the quartiles by:

  1. Ordering the set
  2. Splitting the set in 4 subsets
  3. Getting the mean between the extremes of the subsets

Example:

[30, 35, 53], [55, 57, 57], [60, 61, 64], [71, 78, 90]

  • Q1 =
  • Q2 =
  • Q3 =
  1. The mean, the quartiles, and the standard deviation all reduce by 5%:

    1. New ,
    2. New
    3. New ,
    4. New ,
    5. New .
  2. We do the same thing for some reason:

    1. New ,
    2. New ,
    3. New ,
    4. New ,
    5. New .
  3. Range and IRQ.

    1. Range is just the max - min value.
    2. IRQ is the length of the interval [Q1, Q3].
  4. Box pot:

Example

I can't do all exercises, from here i just categorize them

Recitation 3


Regression line

Imagine need to find that line:

We don't have to bruteforce our way to the best-fitting line.
We already know that a line can be described with the following formula:

  1. We find r (correlation coefficient):

    1. We find the covariance:

    2. Then:

  2. We find :

  3. We find q using the mean values, because we are sure that they are on the regression line:

Warning

If the exercise asks you to motivate why the regression line fits the data well, you say that .


of the variability of X is explained by Y.

Residual

Recitation 4


Sample space

It is the collection of all possible outcomes of an experiment.

A box contains four balls: one red, one blue, one yellow and one pink.

  • Consider an experiment that consists of drawing a ball from the box at random, replacing it, and drawing a second ball:
    S = {RR, RB, RY, RP, BR, BB, BY, BP, YR, YB, YY, YP, PR, PB, PY, PP}
  • Let A be the event that the first ball drawn is Yellow. List all outcomes in A:
    A = {YR, YB, YY, YP}
  • Let B the event that both balls have the same color. List all the outcomes in B:
    B = {RR, BB, YY, PP}

Probabilities

We are in the same sample space as the examples above:

  1. Compute P(A) and P(Ac):
    P(A) 1/4 = 0.25
    P(Ac) = 1 - 0.25 = 0.75
  2. Compute P(B) and P(Bc):
    P(B) = 1/4 = 0.25
    P(Bc) = 1 - 0.25 = 0.75
  3. Compute P(A and B):
    P(A and B) = P(A ∩ B) = 1/16 = 0.0625
  4. Compute P(A or B):
    P(A or B) = P(A ∪ B) = 0.25 + 0.25 - 0.06 = 0.44
  5. Compute P(A | B):
    P(A | B) = 0.0625/0.25 = 0.25
Quick recap on conditional probability:

When we are searching for the probability of an event A given that another event B has already happened, we can restrict the sample space to the event B.


Now we search for the intersection of the two events in the sample space of B and we get this formula:


Dependend and Independent Events

Two events are dependent if the happening of one of them changes the probability of the other one happening.

In this formula, the more B is independent from A, the more approaches .

P(N) = 175 / 590 = 0.296
P(S) = 570 / 590 = 0.966
P(S|N) = 160 / 175 = 0.914

  • Compute the probability that an individual did not wear a seatbelt and survived:
    P(N and S) = P(N) x P(S|N) = 0.296 x (160/175) = 0.27
    P(N and S) if independent = P(N) x P(S) = 0.296 x 0.966 = 0.286

    Since those two are not equal, the events are not independent.

Recitation 5


Probability distributions

We are sometimes asked to find the mean of a probability distribution. That is the Expectation:

If we are asked to find the variance:

Tldr

Literally the expectation of the squared difference of the points from the mean.

Note

If you are asked to complete the distribution, remember that the y values must amount to 1.

Normal curve and Z-Score

Some exercises may ask you to calculate the probability in an interval of the Normal distribution.

I think you would do this with integrals, but i guess that integrating the normal distribution might not be easy?

Anyway, we have this exercise:

In a population the vehicle speed distribution is well approximated by a Normal curve with mean 50 and standard deviation 15.

  • Compute the probability that a randomly selected vehicle speed is greater than 73

What is a z-score and what's its purpose:

Basically the z-score is how many standard deviations our value is away from the mean.
The z-score is useful because it is standardized for every normal curve.

Basically if we get an exercise like the one above, where we would need to use an integral, we have a table of ready-to-go values, the z-table.

The z-table:

The z-table assigns to every z-score the area under the curve up to that z-score(the left of it).
The table is computed from the standard normal curve, but the z-score is standardized, so if our distribution follows a normal curve we can use the table.

Getting back to the exercise:

Compute the probability that a randomly selected vehicle speed is greater than 73:

  1. We compute the z-score:

  1. Now we've got to use the z-table to find the area corresponding to the z-score of 1.53:

We use the entry -1.5 because if we used 1.53 we would get all the area to the left of 1.53, and we want the area to the right. We can use the negative z because the normal distribution is symmetric:

Our result is 0.063.

Info

We could have also computed the complement of the area, instead of getting the inverse of the z-score.

This is because the total area of the normal curve is 1.

Bernoulli distribution

A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times.

Binomial distributions must meet the following three criteria:

  1. The number of observations or trials is fixed. 
    In other words, you can only figure out the probability of something happening if you do it a certain number of times. This is common sense:
- If you toss a coin once, your probability of getting a tails is 50%. 
- If you toss a coin a 20 times, your probability of getting a tails is very, very close to 100%.
  1. Each observation or trial is independent.
    In other words, none of your trials have an effect on the probability of the next trial.

  2. The probability of success (tails, heads, fail or pass) is exactly the same from one trial to another.

The thing is imagine we flip a coin 5 times. There are equally likely outcomes. Now we want to know how many outcomes have 3 heads in them.
The formula to find out is:
Where:

  • is the number of trials or the size of the sample.
  • is the number of successes, or in this case the number of heads.
  • is the probability of success in one trial.
  • , or the probability of failure in one trial.
  • is the probability of getting x successes(here 3 heads) in trials with probability of success per trial.

Mean of a binomial distribution is given by:
Variance of a binomial distribution is given by:

Hint

With big sample sizes this formula approximates to a normal distribution, so we can use z-scores to find areas under the curve.

Recitation 6


Sampling distributions

When people enter an apple store, of them buys a product before leaving. This is the real proportion, it is the ground truth and we have it.

Imagine we sample the population and try to obtain p from the samples. Now P becomes uncertain, it is random variable :

According to the Central Limit Theorem, for large samples, the sample proportion is approximately normally distributed, with mean:
and standard deviation of a proportion:

Where:

  • is the proportion/statistic of something.
  • is the mean of the distribution of sampled proportions.
  • is the standard deviation of the sampled proportions.

If we are investingating a mean(not a proportion) the formula for standard deviation is:

Warning

Sometimes we want to compute the probability of successes being more than a certain number.

We know that we can get the area under a curve by using the z-scores, but this distribution only approximates a normal distribution when using a large n.

So when we have a small n we need to go sideways:

  • If our configuration is also a binomial distribution, we can use that formula to compute every single discrete probability.
Example

For the population of individuals who own an iPhone, suppose p = 0.25 is the proportion that has a given app.

  1. For a random sample of size n = 4, and the mean and the standard deviation of the sampling distribution of the sample proportion:

  1. Find the probability that the proportion of having the app is at least 0.75 when n = 4.

    Here the sample size is too small, so we can't use the normal distribution stuff.
    0.75 of 4 = 3, so we need the probability that at least 3 people have the app.
    We do that by summing the probabilities that 3 people have the app and 4 people have the app.
    Since those probabilities are discrete and there are only 2 possible outcomes per trial, we can use the binomial distribution formula

Example

In the population, IQ scores are normally distributed with mean µ = 100 and variance σ 2 = 15. Suppose to draw a random samples of 25 individuals from the population and measure the IQ score

  1. Compute the probability of observing a sample mean between 98 and 102 when drawing a sample of 25 individuals:

The standard deviation of the sample mean for a sample of size 25. Is given by:

The z-scores for 98 and 102 are:

The areas given by the z-tables for the z-scores are:

The area between the two z-scores is given by:

Recitation 7


Standard Error

It's the average error of the estimation from the samples.

For the sample mean it is:

For the sample proportion it is:

What is the Difference between Standard Error and Standard Deviation?

Standard error and standard deviation are both measures of variability, but standard deviation is a descriptive statistic that can be calculated from sample data, while standard error is an inferential statistic that can only be estimated.

Confidence interval

Confidence Level

The confidence level is the overall capture rate if the method is used many times. The sample mean will vary from sample to sample, but the method estimate ± margin of error is used to get an interval based on each sample.
C% of these intervals capture the unknown population mean 𝜇.
In other words, the actual mean will be located within the interval C% of the time.

The population mean for a certain variable is estimated by computing a confidence interval for that mean.

The formula for the confidence interval is:

In order to find the confidence interval, we must find the margin of error first:

Formula Explanation

For each C% there is a specific z-score, that gives you the bounds of the interval.

The margin of error is just the un-normalized bound, because we are converting the z-score to a real value.

Where to find values for ?

You can find the values for in the c-table or z-table.

Ex. When a General Social Survey asked 1326 subjects, "Do you believe in science?", the proportion who answered yes was 0.82.

Construct the 95% confidence interval.

The sample proportion is equal to 0.82, now we just need the standard error in order to calculate the margin of error.

Why this formula?

This is the standard error formula for the sample proportion:

It's different from the one we use for the sample mean:

So the confidence interval would be

How to interpret

We are 95% confidence that between 79.8% and 84.2% of people believe in science.

Hint

If sample size increases, the margin of error decreases, and thus the CI becomes narrower.

Hint

Describe the effect of standard deviation, sample size and α on the confidence interval.

  • lower standard deviation → lower margin of error → narrower CI
  • higher standard deviation → higher margin of error → wider CI
  • lower sample size → higher margin of error → wider CI
  • higher sample size → lower margin of error → narrower CI
  • lower α → higher level of confidence 1-α →higher margin of error → wider CI
  • higher α → lower level of confidence 1-α → lower margin of error → narrower CI

Significance Level

The significance level is a threshold that determines whether a study result can be considered statistically significant after performing the statistical tests.

A of 0.05 indicates a 5% risk of concluding that a difference exists between the population mean and the sample mean, when there is no actual difference(so the probability of getting a bad sample).

We expect to obtain a sample mean that falls in the critical region 5% of the time.

Multiplier for significance level

I think it is the z-value for the bounds of the critical regions.
If , then a single area measures .

So we need to find the z-value for the area 0.025 and that will be our "multiplier".

t-distribution

It's like a normal curve but wider.
You must use the t-distribution table(instead of z-table) when the population standard deviation is not known and the sample size is small (n<30).

General Correct Rule: If σ is not known, then using t-distribution is correct. If σ is known, then using the normal distribution is correct.

Danger

Do exercise 2 and the last ones. We did not exercises on t-distribution.

Recitation 8


Significance Test

I will explain using an exercise:

In a sample of 402 Tor Vergata first-year students, 174 are enrolled into Statistics course.

Is the proportion of students enrolled into Statistics course in the population of all Tor Vergata first-year students different from 0.50 at the significance level α = 0.05?

1. Assumptions

The distribution approximates to a normal distribution because of the large sample size.

2. Hypothesis

We are given an hypothesis:

Where:

  • is the actual hypothesis, or null hypothesis.
  • is the alternative hypothesis.

In a significance test, the null hypothesis is presumed to be true unless the data give strong evidence against it.

3. Test statistic

A test statistic measures how far the point estimate falls from the parameter value given in the null hypothesis. The result is the number of standard errors between the two.

First, we construct the normal curve considering the hypothesis:

Then we take the sample:

We can already see from the plot that this sample proportion really doesn't agree with our hypothesis.
Mathematically, to disprove the hypothesis we need to check if the sample mean/proportion lands beyond the significance level threshold.

In order to do just that, we need the z-score for the sample proportion, also called the test statistic:

Now we compute the areas under the curves to determine whether the sample proportion exceedes the threshold or not.

4. p-value

This is just the area under the curve at the left of the test statistic value. We can later compare it with the significance level to see if we really are out of bounds.

Why x2?

Because the significance level is the area under both tails of the distribution.
So we can either use or when comparing.

5. Conclusion

Once we have the p-value, we proceed to either accept or reject the hypothesis by comparing the two area

  • If , we reject
  • If , we accept

In this case , so we reject the hypothesis.

Recitation 9


Simple regression equation

A regression equation describes how the mean value of a Y-variable relates to specific values of the X-variable:

SSE

SSE stands for "Sum of Squared Errors", it is calculated as follows:

Where:

  • is the predicted value for
  • is the ground truth for

We are basically computing a sum of the residuals, once we have our prediction:

How to find and

If we want to find the two parameters, we got to minimize the SSE.

We do that by computing and as:

They are called the "least squares estimates" of and .

Warning

We use the notation and when we are talking about population/ground truth parameters.

We use the notation and when we are talking about fitted/predicted parameters.

MSE

If we are trying to estimate how good our model is, we can use the "Mean Squared Error":

Where:

  • is ?
  • is the number of parameters, in this case 2.
Hint

The MSE is the sample variance of the errors and estimates :

Of course, then the standard deviation of errors is computed as:

Which is the sample standard deviation of the errors (residuals) from the regression line.
It can be interpreted as the average deviation of individuals from the sample regression line.

SST

It is the total sum of squares:

It is basically the sum of all the squared deviations from the mean.

correlation coefficient.

Hint

It is interpreted as the fraction of variation in y that is explained by the fitted regression equation. It is often converted to a percentage.